Credit Card Users Churn Prediction¶

Problem Statement¶

Business Context¶

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

Data Description¶

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank (in months)
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: Total Revolving Balance on the Credit Card
  • Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio: Average Card Utilization Ratio

What Is a Revolving Balance?¶

  • If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance
What is the Average Open to buy?¶
  • 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?¶
  • The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:¶
  • ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

Please read the instructions carefully before starting the project.¶

This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.

  • Blanks '___' are provided in the notebook that needs to be filled with an appropriate code to get the correct result. With every '___' blank, there is a comment that briefly describes what needs to be filled in the blank space.
  • Identify the task to be performed correctly, and only then proceed to write the required code.
  • Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
  • Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
  • Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.

Importing necessary libraries¶

In [1]:
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
In [2]:
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [3]:
# This will help in making the Python code more structured automatically (good coding practice)
# %load_ext nb_black

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores, and split data
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To impute missing values
from sklearn.impute import SimpleImputer
from sklearn import metrics

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

Loading the dataset¶

In [4]:
bank = pd.read_csv("BankChurners.csv")
In [5]:
bank.shape
Out[5]:
(10127, 21)

The dataset has 10127 rows and 21 columns

Data Overview¶

  • Observations
  • Sanity checks
In [6]:
data = bank.copy()
In [7]:
data.head()
Out[7]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000
In [8]:
data.tail()
Out[8]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 3 2 3 4003.000 1851 2152.000 0.703 15476 117 0.857 0.462
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 4 2 3 4277.000 2186 2091.000 0.804 8764 69 0.683 0.511
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 5 3 4 5409.000 0 5409.000 0.819 10291 60 0.818 0.000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 4 3 3 5281.000 0 5281.000 0.535 8395 62 0.722 0.000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 6 2 4 10388.000 1961 8427.000 0.703 10294 61 0.649 0.189
In [9]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

Duplicate value¶

In [10]:
# check for duplicate values in the data
data.duplicated().sum()
Out[10]:
0

Missing value¶

In [11]:
#missing values in the data
round(data.isnull().sum() / data.isnull().count() * 100, 2)
Out[11]:
CLIENTNUM                   0.000
Attrition_Flag              0.000
Customer_Age                0.000
Gender                      0.000
Dependent_count             0.000
Education_Level            15.000
Marital_Status              7.400
Income_Category             0.000
Card_Category               0.000
Months_on_book              0.000
Total_Relationship_Count    0.000
Months_Inactive_12_mon      0.000
Contacts_Count_12_mon       0.000
Credit_Limit                0.000
Total_Revolving_Bal         0.000
Avg_Open_To_Buy             0.000
Total_Amt_Chng_Q4_Q1        0.000
Total_Trans_Amt             0.000
Total_Trans_Ct              0.000
Total_Ct_Chng_Q4_Q1         0.000
Avg_Utilization_Ratio       0.000
dtype: float64

Education_Level has 15% missing value.

Marital_Status has 7.4% missing value.

All other columns do not have any missing values.

In [12]:
data.describe().T
Out[12]:
count mean std min 25% 50% 75% max
CLIENTNUM 10127.000 739177606.334 36903783.450 708082083.000 713036770.500 717926358.000 773143533.000 828343083.000
Customer_Age 10127.000 46.326 8.017 26.000 41.000 46.000 52.000 73.000
Dependent_count 10127.000 2.346 1.299 0.000 1.000 2.000 3.000 5.000
Months_on_book 10127.000 35.928 7.986 13.000 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.000 3.813 1.554 1.000 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.000 2.341 1.011 0.000 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.000 2.455 1.106 0.000 2.000 2.000 3.000 6.000
Credit_Limit 10127.000 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.000 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.000 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.000 0.760 0.219 0.000 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.000 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.000 64.859 23.473 10.000 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.000 0.712 0.238 0.000 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.000 0.275 0.276 0.000 0.023 0.176 0.503 0.999

Clientnum is a unique value and has no statistical importance, hence can be dropped.

Customer age range is between 26 and 73.

The maximum in months that the customer has been with the bank is 56 months.

50% have atleast 2 dependents.

In [13]:
# categorical variables
cat_col = data.columns
In [14]:
data.drop(["CLIENTNUM"], axis=1, inplace=True) #Drop clientnum as it is unique per customer and has no relation to the target variable.
In [15]:
data["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)
data["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
In [16]:
#create copy of data
data1 = data.copy()

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. How is the total transaction amount distributed?
  2. What is the distribution of the level of education of customers?
  3. What is the distribution of the level of income of customers?
  4. How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
  5. How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
  6. What are the attributes that have a strong correlation with each other?

The below functions need to be defined to carry out the Exploratory Data Analysis.¶

In [17]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [18]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [19]:
# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [20]:
### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Univariante Analysis¶

In [21]:
num_col_sel = data.select_dtypes(include=np.number).columns.tolist()

for item in num_col_sel:
    histogram_boxplot(data, item)

Mean Credit limit is greater than 5000 but the median is less than 5000.

The median for Customer Age is 46.

Attrition_Flag¶

In [22]:
data['Attrition_Flag'].nunique()
Out[22]:
2
In [23]:
data['Attrition_Flag'].value_counts()
Out[23]:
0    8500
1    1627
Name: Attrition_Flag, dtype: int64

8500 customers still have an account, while 1627 customers have closed their accounts.

In [24]:
labeled_barplot(data,'Attrition_Flag', perc=True)

Customer_Age¶

In [25]:
data['Customer_Age'].nunique()
Out[25]:
45
In [26]:
data['Customer_Age'].value_counts()
Out[26]:
44    500
49    495
46    490
45    486
47    479
43    473
48    472
50    452
42    426
51    398
53    387
41    379
52    376
40    361
39    333
54    307
38    303
55    279
56    262
37    260
57    223
36    221
35    184
59    157
58    157
34    146
33    127
60    127
32    106
65    101
61     93
62     93
31     91
26     78
30     70
63     65
29     56
64     43
27     32
28     29
67      4
66      2
68      2
70      1
73      1
Name: Customer_Age, dtype: int64

The mode for Customer Age is 44

In [27]:
sns.boxplot(data=data,x='Customer_Age')
#Boxplot to show the distribution of Customer_Age
Out[27]:
<Axes: xlabel='Customer_Age'>

More than 50% of the clients fall under 50 years. There are few outliers on the upper end.

Gender¶

In [28]:
data['Gender'].nunique()
Out[28]:
2
In [29]:
data['Gender'].value_counts()
Out[29]:
F    5358
M    4769
Name: Gender, dtype: int64
In [30]:
labeled_barplot(data,'Gender', perc=True)
Observation¶

52.9% of the clients are female.

47.1% of the clients are male.

Dependent_count¶

In [31]:
data['Dependent_count'].nunique()
Out[31]:
6
In [32]:
data['Dependent_count'].value_counts()
Out[32]:
3    2732
2    2655
1    1838
4    1574
0     904
5     424
Name: Dependent_count, dtype: int64
In [33]:
labeled_barplot(data,'Dependent_count', perc=True)

More than 50% of the customers have atleast 2 dependents.

Atleast 90% of the customers have atleast one dependent.

Education_Level¶

In [34]:
data['Education_Level'].nunique()
Out[34]:
6
In [35]:
data['Education_Level'].value_counts()
Out[35]:
Graduate         3128
High School      2013
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451
Name: Education_Level, dtype: int64
In [36]:
labeled_barplot(data,'Education_Level', perc=True)

30.9% are graduates.

Only 4.5% have a doctorate.

50% have atleast a college level of education.

Martital_Status¶

In [37]:
data['Marital_Status'].nunique()
Out[37]:
3
In [38]:
data['Marital_Status'].value_counts()
Out[38]:
Married     4687
Single      3943
Divorced     748
Name: Marital_Status, dtype: int64
In [159]:
labeled_barplot(data,'Marital_Status', perc=True)

46.3% of the customer of the customers are married.

Only 7.4 % of the customers are divorced

Income_Category¶

In [40]:
data['Income_Category'].nunique()
Out[40]:
6
In [41]:
data['Income_Category'].value_counts()
Out[41]:
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: Income_Category, dtype: int64
In [42]:
labeled_barplot(data,'Income_Category', perc=True)

35.2 % of the customers earn less than $40k.

There is an anomalous value "abc" which accounts for 11% of the values.

Card_Category¶

In [43]:
data['Card_Category'].nunique()
Out[43]:
4
In [44]:
data['Card_Category'].value_counts()
Out[44]:
Blue        9436
Silver       555
Gold         116
Platinum      20
Name: Card_Category, dtype: int64
In [45]:
labeled_barplot(data,'Card_Category', perc=True)

93.2% of the customers have Blue cards.

Only 0.2 % have Platinum Cards.

Months_on_book¶

In [46]:
data['Months_on_book'].nunique()
Out[46]:
44
In [47]:
data['Months_on_book'].value_counts()
Out[47]:
36    2463
37     358
34     353
38     347
39     341
40     333
31     318
35     317
33     305
30     300
41     297
32     289
28     275
43     273
42     271
29     241
44     230
45     227
27     206
46     197
26     186
47     171
25     165
48     162
24     160
49     141
23     116
22     105
56     103
50      96
21      83
51      80
53      78
20      74
13      70
19      63
52      62
18      58
54      53
55      42
17      39
15      34
16      29
14      16
Name: Months_on_book, dtype: int64
In [48]:
sns.boxplot(data=data,x='Months_on_book')
#Boxplot to show the distribution of Months_on_book
Out[48]:
<Axes: xlabel='Months_on_book'>

More than 75% have been with the bank for atleast 30 months.

Total_Relationship_Count¶

In [49]:
data['Total_Relationship_Count'].nunique()
Out[49]:
6
In [50]:
data['Total_Relationship_Count'].value_counts()
Out[50]:
3    2305
4    1912
5    1891
6    1866
2    1243
1     910
Name: Total_Relationship_Count, dtype: int64
In [51]:
labeled_barplot(data,'Total_Relationship_Count', perc=True)

22.8% have held 3 products.

Months_Inactive_12_mon¶

In [52]:
data['Months_Inactive_12_mon'].nunique()
Out[52]:
7
In [53]:
data['Months_Inactive_12_mon'].value_counts()
Out[53]:
3    3846
2    3282
1    2233
4     435
5     178
6     124
0      29
Name: Months_Inactive_12_mon, dtype: int64
In [161]:
sns.boxplot(data=data,x='Months_Inactive_12_mon')
#Boxplot to show the distribution of Months_Inactive_12_mon
Out[161]:
<Axes: xlabel='Months_Inactive_12_mon'>

50% of the customers have been inactive for 2-3 months. Only 29 customers weren't inactive at all.

Contacts_Count_12_mon¶

In [55]:
data['Contacts_Count_12_mon'].nunique()
Out[55]:
7
In [56]:
data['Contacts_Count_12_mon'].value_counts()
Out[56]:
3    3380
2    3227
1    1499
4    1392
0     399
5     176
6      54
Name: Contacts_Count_12_mon, dtype: int64
In [160]:
sns.boxplot(data=data,x='Contacts_Count_12_mon')
#Boxplot to show the distribution of Contacts_Count_12_mon
Out[160]:
<Axes: xlabel='Contacts_Count_12_mon'>

50% had 2-3 contacts.

Credit_Limit¶

In [58]:
data['Credit_Limit'].nunique()
Out[58]:
6205
In [59]:
data['Credit_Limit'].value_counts()
Out[59]:
34516.000    508
1438.300     507
9959.000      18
15987.000     18
23981.000     12
            ... 
9183.000       1
29923.000      1
9551.000       1
11558.000      1
10388.000      1
Name: Credit_Limit, Length: 6205, dtype: int64
In [60]:
sns.boxplot(data=data,x='Credit_Limit')
#Boxplot to show the distribution of Credit_Limit
Out[60]:
<Axes: xlabel='Credit_Limit'>

50% of the customers have a credit_limit of below 5000 but there are many outliers on the upper end.

Total_Revolving_Bal¶

In [61]:
data['Total_Revolving_Bal'].nunique()
Out[61]:
1974
In [62]:
data['Total_Revolving_Bal'].value_counts()
Out[62]:
0       2470
2517     508
1965      12
1480      12
1434      11
        ... 
2467       1
2131       1
2400       1
2144       1
2241       1
Name: Total_Revolving_Bal, Length: 1974, dtype: int64
In [63]:
sns.boxplot(data=data,x='Total_Revolving_Bal')
#Boxplot to show the distribution of Total_Revolving_Bal
Out[63]:
<Axes: xlabel='Total_Revolving_Bal'>

Avg_Open_To_Buy¶

In [64]:
data['Avg_Open_To_Buy'].nunique()
Out[64]:
6813
In [65]:
data['Avg_Open_To_Buy'].value_counts()
Out[65]:
1438.300     324
34516.000     98
31999.000     26
787.000        8
701.000        7
            ... 
6543.000       1
2808.000       1
21549.000      1
6189.000       1
8427.000       1
Name: Avg_Open_To_Buy, Length: 6813, dtype: int64
In [66]:
sns.boxplot(data=data,x='Avg_Open_To_Buy')
#Boxplot to show the distribution of Avg_Open_To_Buy
Out[66]:
<Axes: xlabel='Avg_Open_To_Buy'>

Total_Amt_Chng_Q4_Q1¶

In [67]:
data['Total_Amt_Chng_Q4_Q1'].nunique()
Out[67]:
1158
In [68]:
data['Total_Amt_Chng_Q4_Q1'].value_counts()
Out[68]:
0.791    36
0.712    34
0.743    34
0.718    33
0.735    33
         ..
1.216     1
1.645     1
1.089     1
2.103     1
0.166     1
Name: Total_Amt_Chng_Q4_Q1, Length: 1158, dtype: int64
In [69]:
sns.boxplot(data=data,x='Total_Amt_Chng_Q4_Q1')
#Boxplot to show the distribution of Total_Amt_Chng_Q4_Q1
Out[69]:
<Axes: xlabel='Total_Amt_Chng_Q4_Q1'>

Total_Trans_Amt¶

In [70]:
data['Total_Trans_Amt'].nunique()
Out[70]:
5033
In [71]:
data['Total_Trans_Amt'].value_counts()
Out[71]:
4253     11
4509     11
4518     10
2229     10
4220      9
         ..
1274      1
4521      1
3231      1
4394      1
10294     1
Name: Total_Trans_Amt, Length: 5033, dtype: int64
In [72]:
sns.boxplot(data=data,x='Total_Trans_Amt')
#Boxplot to show the distribution of Total_Trans_Amt
Out[72]:
<Axes: xlabel='Total_Trans_Amt'>

75% of the customers have a total transaction ammount of less than 5000 but there are many outliers on the higher end.

Total_Trans_Ct¶

In [73]:
data['Total_Trans_Ct'].nunique()
Out[73]:
126
In [74]:
data['Total_Trans_Ct'].value_counts()
Out[74]:
81     208
71     203
75     203
69     202
82     202
      ... 
11       2
134      1
139      1
138      1
132      1
Name: Total_Trans_Ct, Length: 126, dtype: int64
In [75]:
sns.boxplot(data=data,x='Total_Trans_Ct')
#Boxplot to show the distribution of Total_Trans_Ct
Out[75]:
<Axes: xlabel='Total_Trans_Ct'>

Total_Ct_Chng_Q4_Q1¶

In [76]:
data['Total_Ct_Chng_Q4_Q1'].nunique()
Out[76]:
830
In [77]:
data['Total_Ct_Chng_Q4_Q1'].value_counts()
Out[77]:
0.667    171
1.000    166
0.500    161
0.750    156
0.600    113
        ... 
0.827      1
0.343      1
1.579      1
0.125      1
0.359      1
Name: Total_Ct_Chng_Q4_Q1, Length: 830, dtype: int64
In [78]:
sns.boxplot(data=data,x='Total_Trans_Ct')
#Boxplot to show the distribution of Total_Trans_Ct
Out[78]:
<Axes: xlabel='Total_Trans_Ct'>

Avg_Utilization_Ratio¶

In [79]:
data['Avg_Utilization_Ratio'].nunique()
Out[79]:
964
In [80]:
data['Avg_Utilization_Ratio'].value_counts()
Out[80]:
0.000    2470
0.073      44
0.057      33
0.048      32
0.060      30
         ... 
0.927       1
0.935       1
0.954       1
0.385       1
0.009       1
Name: Avg_Utilization_Ratio, Length: 964, dtype: int64
In [81]:
sns.boxplot(data=data,x='Avg_Utilization_Ratio')
#Boxplot to show the distribution of Avg_Utilization_Ratio
Out[81]:
<Axes: xlabel='Avg_Utilization_Ratio'>

Multivariante Analysis¶

In [82]:
num_col = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 7))
sns.heatmap(data[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Credit Limit and Attrition Flag have complete correlation. Total_Trans_Amt and Total_Trans_Ct have strong correlation. Months_on_book and Customer_Age also have a relatively strong correlation. There is a negative correlation between Total_Trans_Amt and Attrition flag.

In [83]:
sns.pairplot(data=data[num_col], diag_kind="kde")
plt.show()

Bivariante Analysis¶

Attrition_Flag vs Customer_Age¶

In [84]:
distribution_plot_wrt_target(data, "Customer_Age", "Attrition_Flag")

Attrition_Flag vs Gender¶

In [85]:
stacked_barplot(data, "Gender", "Attrition_Flag")
Attrition_Flag     0     1    All
Gender                           
All             8500  1627  10127
F               4428   930   5358
M               4072   697   4769
------------------------------------------------------------------------------------------------------------------------

Both genders have close to the same percentage of attrited customers with females having slightly more.

Attrition_Flag vs Dependent_count¶

In [86]:
stacked_barplot(data, "Dependent_count", "Attrition_Flag")
Attrition_Flag      0     1    All
Dependent_count                   
All              8500  1627  10127
3                2250   482   2732
2                2238   417   2655
1                1569   269   1838
4                1314   260   1574
0                 769   135    904
5                 360    64    424
------------------------------------------------------------------------------------------------------------------------

Attrition_Flag vs Education_Level¶

In [87]:
stacked_barplot(data, "Education_Level", "Attrition_Flag")
Attrition_Flag      0     1   All
Education_Level                  
All              7237  1371  8608
Graduate         2641   487  3128
High School      1707   306  2013
Uneducated       1250   237  1487
College           859   154  1013
Doctorate         356    95   451
Post-Graduate     424    92   516
------------------------------------------------------------------------------------------------------------------------

Attrition_Flag vs Marital_Status¶

In [88]:
stacked_barplot(data, "Marital_Status", "Attrition_Flag")
Attrition_Flag     0     1   All
Marital_Status                  
All             7880  1498  9378
Married         3978   709  4687
Single          3275   668  3943
Divorced         627   121   748
------------------------------------------------------------------------------------------------------------------------

Attrition_Flag vs Income_Category¶

In [89]:
stacked_barplot(data, "Income_Category", "Attrition_Flag")
Attrition_Flag      0     1    All
Income_Category                   
All              8500  1627  10127
Less than $40K   2949   612   3561
$40K - $60K      1519   271   1790
$80K - $120K     1293   242   1535
$60K - $80K      1213   189   1402
abc               925   187   1112
$120K +           601   126    727
------------------------------------------------------------------------------------------------------------------------

More than 25% of the attrited customers have less than $40K income

Attrition_Flag vs Card_Category¶

In [90]:
stacked_barplot(data, "Card_Category", "Attrition_Flag")
Attrition_Flag     0     1    All
Card_Category                    
All             8500  1627  10127
Blue            7917  1519   9436
Silver           473    82    555
Gold              95    21    116
Platinum          15     5     20
------------------------------------------------------------------------------------------------------------------------

Most attrited customers hold blue cards.

Attrition_Flag vs Months_on_book¶

In [91]:
distribution_plot_wrt_target(data, "Months_on_book", "Attrition_Flag")

Attrition_Flag vs Total_Relationship_Count¶

In [92]:
stacked_barplot(data, "Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag               0     1    All
Total_Relationship_Count                   
All                       8500  1627  10127
3                         1905   400   2305
2                          897   346   1243
1                          677   233    910
5                         1664   227   1891
4                         1687   225   1912
6                         1670   196   1866
------------------------------------------------------------------------------------------------------------------------

More than 50% have held 3 or less products.

Attrition_Flag vs Months_Inactive_12_mon¶

In [93]:
stacked_barplot(data, "Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag             0     1    All
Months_Inactive_12_mon                   
All                     8500  1627  10127
3                       3020   826   3846
2                       2777   505   3282
4                        305   130    435
1                       2133   100   2233
5                        146    32    178
6                        105    19    124
0                         14    15     29
------------------------------------------------------------------------------------------------------------------------

826 of the total 1627 attrited customers had been inactive for 3 months. More than 50% of the attrited customers were inactive for atleast 3 months. More than 50% of the existing customers were inactive for 3 months or less.

Attrition_Flag vs Contacts_Count_12_mon¶

In [94]:
stacked_barplot(data, "Contacts_Count_12_mon", "Attrition_Flag")
Attrition_Flag            0     1    All
Contacts_Count_12_mon                   
All                    8500  1627  10127
3                      2699   681   3380
2                      2824   403   3227
4                      1077   315   1392
1                      1391   108   1499
5                       117    59    176
6                         0    54     54
0                       392     7    399
------------------------------------------------------------------------------------------------------------------------

The highest number of contacts (6) had all 100% of attrited customers.

Attrition_Flag vs Credit_Limit¶

In [95]:
distribution_plot_wrt_target(data, "Credit_Limit", "Attrition_Flag")

Attrition_Flag vs Total_Revolving_Bal¶

In [96]:
distribution_plot_wrt_target(data, "Total_Revolving_Bal", "Attrition_Flag")

Most Attrited customers had a total revolving balance below 1500 while more than 50% of the existing customers have more than 1000 dollars worth total revolving balance.

Attrition_Flag vs Avg_Open_To_Buy¶

In [97]:
distribution_plot_wrt_target(data, "Avg_Open_To_Buy", "Attrition_Flag")

Both distributions are similar and there is no determining pattern visible to differentiate attrited customers based on Avg_Open_To_Buy

Attrition_Flag vs Total_Amt_Chng_Q4_Q1¶

In [98]:
distribution_plot_wrt_target(data, "Total_Amt_Chng_Q4_Q1", "Attrition_Flag")

There are a lot of outliers on the upper end for existing customers but the outliers are balanced for the attrited customers and the total_amt_chnge_Q4_Q1 is less than one for most attrited customers.

Attrition_Flag vs Total_Trans_Amt¶

In [99]:
distribution_plot_wrt_target(data, "Total_Trans_Amt", "Attrition_Flag")

Attrited Customers have a higher Total Transaction Amount distribution compared to existing customers.

Attrition_Flag vs Total_Trans_Ct¶

In [100]:
distribution_plot_wrt_target(data, "Total_Trans_Ct", "Attrition_Flag")

Attrition_Flag vs Total_Ct_Chng_Q4_Q1¶

In [101]:
distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")

Attrition_Flag vs Avg_Utilization_Ratio¶

In [102]:
distribution_plot_wrt_target(data, "Avg_Utilization_Ratio", "Attrition_Flag")

Data Pre-processing¶

Outlier detection¶

In [103]:
data2=data.copy()
In [104]:
IQR = data2.quantile(0.75) - data2.quantile(0.25) #interquartile range
lower_bound = data2.quantile(0.25)-1.5 * IQR #estabish lower bound
upper_bound = data2.quantile(0.75) + 1.5 *IQR #establish upper bound
In [105]:
outlier=((data2.select_dtypes(include=["float64", "int64"])< lower_bound) | (data2.select_dtypes(include=["float64", "int64"])>upper_bound)).sum()
outlier/len(data2)*100
Out[105]:
Attrition_Flag             16.066
Customer_Age                0.020
Dependent_count             0.000
Months_on_book              3.812
Total_Relationship_Count    0.000
Months_Inactive_12_mon      3.268
Contacts_Count_12_mon       6.211
Credit_Limit                9.717
Total_Revolving_Bal         0.000
Avg_Open_To_Buy             9.509
Total_Amt_Chng_Q4_Q1        3.910
Total_Trans_Amt             8.848
Total_Trans_Ct              0.020
Total_Ct_Chng_Q4_Q1         3.891
Avg_Utilization_Ratio       0.000
dtype: float64

Splitting¶

In [106]:
data2["Income_Category"].replace("abc", np.nan, inplace = True)
In [107]:
X = data2.drop(["Attrition_Flag"], axis=1)
y = data2["Attrition_Flag"]
In [108]:
# Splitting data into training, validation and test set:
# first split data into 2 parts

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then split the first set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
In [109]:
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_val.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6075
Number of rows in validation data = 2026
Number of rows in test data = 2026

Missing value imputation¶

In [110]:
data2.isnull().sum()
Out[110]:
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category             1112
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64
In [111]:
fr_imputer = SimpleImputer(strategy = 'most_frequent')
In [112]:
col_missing = ["Education_Level", "Marital_Status", "Income_Category"]
In [113]:
X_train[col_missing] = fr_imputer.fit_transform(X_train[col_missing])
X_val[col_missing] = fr_imputer.transform(X_val[col_missing])
X_test[col_missing] = fr_imputer.transform(X_test[col_missing])
In [114]:
# Checking that no column has missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
In [115]:
cols = X_train.select_dtypes(include=["object", "category"]) ##Training set
for i in cols.columns:
    print(X_train[i].value_counts())
    print("*" * 40)
F    3193
M    2882
Name: Gender, dtype: int64
****************************************
Graduate         2782
High School      1228
Uneducated        881
College           618
Post-Graduate     312
Doctorate         254
Name: Education_Level, dtype: int64
****************************************
Married     3276
Single      2369
Divorced     430
Name: Marital_Status, dtype: int64
****************************************
Less than $40K    2783
$40K - $60K       1059
$80K - $120K       953
$60K - $80K        831
$120K +            449
Name: Income_Category, dtype: int64
****************************************
Blue        5655
Silver       339
Gold          69
Platinum      12
Name: Card_Category, dtype: int64
****************************************
In [116]:
cols = X_val.select_dtypes(include=["object", "category"]) ##Validation set
for i in cols.columns:
    print(X_val[i].value_counts())
    print("*" * 40)
F    1095
M     931
Name: Gender, dtype: int64
****************************************
Graduate         917
High School      404
Uneducated       306
College          199
Post-Graduate    101
Doctorate         99
Name: Education_Level, dtype: int64
****************************************
Married     1100
Single       770
Divorced     156
Name: Marital_Status, dtype: int64
****************************************
Less than $40K    957
$40K - $60K       361
$80K - $120K      293
$60K - $80K       279
$120K +           136
Name: Income_Category, dtype: int64
****************************************
Blue        1905
Silver        97
Gold          21
Platinum       3
Name: Card_Category, dtype: int64
****************************************
In [117]:
cols = X_test.select_dtypes(include=["object", "category"]) ##Test data set
for i in cols.columns:
    print(X_test[i].value_counts())
    print("*" * 40)
F    1070
M     956
Name: Gender, dtype: int64
****************************************
Graduate         948
High School      381
Uneducated       300
College          196
Post-Graduate    103
Doctorate         98
Name: Education_Level, dtype: int64
****************************************
Married     1060
Single       804
Divorced     162
Name: Marital_Status, dtype: int64
****************************************
Less than $40K    933
$40K - $60K       370
$60K - $80K       292
$80K - $120K      289
$120K +           142
Name: Income_Category, dtype: int64
****************************************
Blue        1876
Silver       119
Gold          26
Platinum       5
Name: Card_Category, dtype: int64
****************************************
In [118]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 29) (2026, 29) (2026, 29)
In [119]:
# check the top 5 rows from the train dataset
X_train.head()
Out[119]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Gender_M Education_Level_Doctorate Education_Level_Graduate Education_Level_High School Education_Level_Post-Graduate Education_Level_Uneducated Marital_Status_Married Marital_Status_Single Income_Category_$40K - $60K Income_Category_$60K - $80K Income_Category_$80K - $120K Income_Category_Less than $40K Card_Category_Gold Card_Category_Platinum Card_Category_Silver
800 40 2 21 6 4 3 20056.000 1602 18454.000 0.466 1687 46 0.533 0.080 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0
498 44 1 34 6 2 0 2885.000 1895 990.000 0.387 1366 31 0.632 0.657 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0
4356 48 4 36 5 1 2 6798.000 2517 4281.000 0.873 4327 79 0.881 0.370 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0
407 41 2 36 6 2 0 27000.000 0 27000.000 0.610 1209 39 0.300 0.000 1 0 1 0 0 0 1 0 0 1 0 0 0 0 1
8728 46 4 36 2 2 3 15034.000 1356 13678.000 0.754 7737 84 0.750 0.090 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1

Model Building¶

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [120]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1

        },
        index=[0],
    )

    return df_perf
In [121]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Model Building with original data¶

Sample code for model building with original data

In [122]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))

print("\nTraining Performance:\n")
for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_train, model.predict(X_train))
    print("{}: {}".format(name, scores))

print("\nValidation Performance:\n")
for name, model in models:
    model.fit(X_train, y_train)
    scores_val = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores_val))
Training Performance:

Bagging: 0.9774590163934426
Random forest: 1.0
GBM: 0.875
Adaboost: 0.826844262295082
dtree: 1.0

Validation Performance:

Bagging: 0.7699386503067485
Random forest: 0.7484662576687117
GBM: 0.8558282208588958
Adaboost: 0.852760736196319
dtree: 0.7944785276073619
In [123]:
print("\nTraining and Validation Performance Difference:\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores_train = recall_score(y_train, model.predict(X_train))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference1 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference1))
Training and Validation Performance Difference:

Bagging: Training Score: 0.9775, Validation Score: 0.7699, Difference: 0.2075
Random forest: Training Score: 1.0000, Validation Score: 0.7485, Difference: 0.2515
GBM: Training Score: 0.8750, Validation Score: 0.8558, Difference: 0.0192
Adaboost: Training Score: 0.8268, Validation Score: 0.8528, Difference: -0.0259
dtree: Training Score: 1.0000, Validation Score: 0.7945, Difference: 0.2055

GBM performed the best with a very small difference of 0.0192 between the training score and validation score. AdaBoost also performed well with a small difference between the validation and training score.

Model Building with Oversampled data¶

In [124]:
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy=1, k_neighbors=5, random_state=1
)  # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 976
Before Oversampling, counts of label 'No': 5099 

After Oversampling, counts of label 'Yes': 5099
After Oversampling, counts of label 'No': 5099 

After Oversampling, the shape of train_X: (10198, 29)
After Oversampling, the shape of train_y: (10198,) 

In [125]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))

print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_train_over, model.predict(X_train_over))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Training Performance:

Bagging: 0.9974504804863699
Random forest: 1.0
GBM: 0.980976662090606
Adaboost: 0.9690135320651108
dtree: 1.0

Validation Performance:

Bagging: 0.8496932515337423
Random forest: 0.8680981595092024
GBM: 0.8926380368098159
Adaboost: 0.901840490797546
dtree: 0.8251533742331288
In [126]:
print("\nTraining and Validation Performance Difference:\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores_train = recall_score(y_train_over, model.predict(X_train_over))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference2 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference2))
Training and Validation Performance Difference:

Bagging: Training Score: 0.9975, Validation Score: 0.8497, Difference: 0.1478
Random forest: Training Score: 1.0000, Validation Score: 0.8681, Difference: 0.1319
GBM: Training Score: 0.9810, Validation Score: 0.8926, Difference: 0.0883
Adaboost: Training Score: 0.9690, Validation Score: 0.9018, Difference: 0.0672
dtree: Training Score: 1.0000, Validation Score: 0.8252, Difference: 0.1748

Adaboost and GBM performed the best but not as good as their original data set counterparts..

Model Building with Undersampled data¶

In [127]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [128]:
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976
Before Under Sampling, counts of label 'No': 5099 

After Under Sampling, counts of label 'Yes': 976
After Under Sampling, counts of label 'No': 976 

After Under Sampling, the shape of train_X: (1952, 29)
After Under Sampling, the shape of train_y: (1952,) 

In [129]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))


print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_train_un, model.predict(X_train_un))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Training Performance:

Bagging: 0.9907786885245902
Random forest: 1.0
GBM: 0.9805327868852459
Adaboost: 0.9528688524590164
dtree: 1.0

Validation Performance:

Bagging: 0.9294478527607362
Random forest: 0.9386503067484663
GBM: 0.9570552147239264
Adaboost: 0.9601226993865031
dtree: 0.9202453987730062
In [130]:
print("\nTraining and Validation Performance Difference:\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores_train = recall_score(y_train_un, model.predict(X_train_un))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference3 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference3))
Training and Validation Performance Difference:

Bagging: Training Score: 0.9908, Validation Score: 0.9294, Difference: 0.0613
Random forest: Training Score: 1.0000, Validation Score: 0.9387, Difference: 0.0613
GBM: Training Score: 0.9805, Validation Score: 0.9571, Difference: 0.0235
Adaboost: Training Score: 0.9529, Validation Score: 0.9601, Difference: -0.0073
dtree: Training Score: 1.0000, Validation Score: 0.9202, Difference: 0.0798

AdaBoost performed the best. GBoost also performed will with a difference of 0.0235.

HyperparameterTuning¶

Since AdaBoost original data, AdaBoost undersampled data, and all three GBoost models performed the best and had the least difference between their validation score and test score, we will be performing hypertuning on those 5 models.

AdaBoost with original data¶

In [131]:
%%time

# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid ={
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.8360596546310832:
CPU times: user 3.46 s, sys: 327 ms, total: 3.78 s
Wall time: 1min 40s
In [132]:
tuned_adb = AdaBoostClassifier(
    random_state=1,
    n_estimators=100,
    learning_rate=0.1,
    base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
tuned_adb.fit(X_train, y_train)
Out[132]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
In [133]:
# Check model performance on training set
adb_train = model_performance_classification_sklearn(tuned_adb, X_train, y_train)
adb_train
Out[133]:
Accuracy Recall Precision F1
0 0.982 0.927 0.961 0.944
In [134]:
# Check model performance on validation set
adb_val = model_performance_classification_sklearn(tuned_adb, X_val, y_val)
adb_val
Out[134]:
Accuracy Recall Precision F1
0 0.967 0.856 0.933 0.893

Adaboost with undersampled data¶

In [135]:
%%time

# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid ={
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9467346938775512:
CPU times: user 1.44 s, sys: 92.8 ms, total: 1.54 s
Wall time: 42.6 s
In [136]:
tuned_adb1 = AdaBoostClassifier(
    random_state=1,
    n_estimators=100,
    learning_rate=0.05,
    base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
tuned_adb1.fit(X_train_un, y_train_un)
Out[136]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.05, n_estimators=100, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.05, n_estimators=100, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
In [137]:
# Check model performance on training set
adb1_train = model_performance_classification_sklearn(tuned_adb1, X_train_un, y_train_un)
adb1_train
Out[137]:
Accuracy Recall Precision F1
0 0.973 0.978 0.968 0.973
In [138]:
# Check model performance on validation set
adb1_val = model_performance_classification_sklearn(tuned_adb1, X_val, y_val)
adb1_val
Out[138]:
Accuracy Recall Precision F1
0 0.937 0.966 0.731 0.832

GradientBoost with original data¶

In [139]:
%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8104395604395604:
CPU times: user 3.7 s, sys: 361 ms, total: 4.06 s
Wall time: 2min 35s
In [140]:
tuned_gbm = GradientBoostingClassifier(
    random_state=1,
    subsample=0.9,
    n_estimators=100,
    max_features=0.5,
    learning_rate=0.1,
    init=AdaBoostClassifier(random_state=1),
)
tuned_gbm.fit(X_train, y_train)
Out[140]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
In [141]:
# Check model performance on training set
gbm_train = model_performance_classification_sklearn(
    tuned_gbm, X_train, y_train
)
gbm_train
Out[141]:
Accuracy Recall Precision F1
0 0.972 0.867 0.955 0.909
In [142]:
# Check model performance on validation set
gbm_val = model_performance_classification_sklearn(tuned_gbm, X_val, y_val)
gbm_val
Out[142]:
Accuracy Recall Precision F1
0 0.968 0.862 0.937 0.898

GradientBoosting with undersampled data¶

In [143]:
%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 75, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9508267922553637:
CPU times: user 1.85 s, sys: 156 ms, total: 2.01 s
Wall time: 1min 7s
In [144]:
tuned_gbm1 = GradientBoostingClassifier(
    random_state=1,
    subsample=0.9,
    n_estimators=75,
    max_features=0.7,
    learning_rate=0.1,
    init=AdaBoostClassifier(random_state=1),
)
tuned_gbm1.fit(X_train_un, y_train_un)
Out[144]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.7, n_estimators=75, random_state=1,
                           subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.7, n_estimators=75, random_state=1,
                           subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
In [145]:
# Check model performance on training set
gbm1_train = model_performance_classification_sklearn(
    tuned_gbm1, X_train_un, y_train_un
)
gbm1_train
Out[145]:
Accuracy Recall Precision F1
0 0.970 0.977 0.964 0.970
In [146]:
# Checking model's performance on validation set
gbm1_val = model_performance_classification_sklearn(tuned_gbm1, X_val, y_val)
gbm1_val
Out[146]:
Accuracy Recall Precision F1
0 0.938 0.957 0.738 0.833

GradientBoosting with Oversampled data¶

In [147]:
%%time

#defining model
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 75, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9447041505512901:
CPU times: user 5.63 s, sys: 592 ms, total: 6.22 s
Wall time: 4min 10s
In [148]:
tuned_gbm2 = GradientBoostingClassifier(
    random_state=1,
    subsample=0.9,
    n_estimators=75,
    max_features=0.7,
    learning_rate=0.1,
    init=AdaBoostClassifier(random_state=1),
)
tuned_gbm2.fit(X_train_over, y_train_over)
Out[148]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.7, n_estimators=75, random_state=1,
                           subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.7, n_estimators=75, random_state=1,
                           subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
In [149]:
# Check model performance on training set
gbm2_train = model_performance_classification_sklearn(tuned_gbm2, X_train_over, y_train_over)
gbm2_train
Out[149]:
Accuracy Recall Precision F1
0 0.973 0.977 0.968 0.973
In [150]:
# Check model performance on validation set
gbm2_val = model_performance_classification_sklearn(tuned_gbm2, X_val, y_val)
gbm2_val
Out[150]:
Accuracy Recall Precision F1
0 0.952 0.887 0.826 0.855

Sample Parameter Grids¶

Note

  1. Sample parameter grids have been provided to do necessary hyperparameter tuning. These sample grids are expected to provide a balance between model performance improvement and execution time. One can extend/reduce the parameter grid based on execution time and system configuration.
    • Please note that if the parameter grid is extended to improve the model performance further, the execution time will increase
  • For Gradient Boosting:
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}
  • For Adaboost:
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
  • For Bagging Classifier:
param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}
  • For Random Forest:
param_grid = {
    "n_estimators": [50,110,25],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}
  • For Decision Trees:
param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}
  • For XGBoost (optional):
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

Sample tuning method for Decision tree with original data¶

In [151]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.751941391941392:

Sample tuning method for Decision tree with oversampled data¶

In [152]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 4} with CV score=0.9111622313302161:

Sample tuning method for Decision tree with undersampled data¶

In [153]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.8934432234432235:

Model Comparison and Final Model Selection¶

In [154]:
# performance comparison for training set

models_train_comp_df = pd.concat(
    [
        gbm_train.T,
        gbm1_train.T,
        gbm2_train.T,
        adb1_train.T,
        adb_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Gradient boosting trained with Original data",
    "Gradient boosting trained with Undersampled data",
    "Gradient boosting trained with Oversampled data",
    "AdaBoost trained with Undersampled data",
    "AdaBoost trained with Original data",
]
print("Performance comparison for training set:")
models_train_comp_df
Performance comparison for training set:
Out[154]:
Gradient boosting trained with Original data Gradient boosting trained with Undersampled data Gradient boosting trained with Oversampled data AdaBoost trained with Undersampled data AdaBoost trained with Original data
Accuracy 0.972 0.970 0.973 0.973 0.982
Recall 0.867 0.977 0.977 0.978 0.927
Precision 0.955 0.964 0.968 0.968 0.961
F1 0.909 0.970 0.973 0.973 0.944
In [155]:
# performance comparison for validation set

models_val_comp_df = pd.concat(
    [
        gbm_val.T,
        gbm1_val.T,
        gbm2_val.T,
        adb1_val.T,
        adb_val.T,
    ],
    axis=1,
)
models_val_comp_df.columns = [
    "Gradient boosting trained with Original data",
    "Gradient boosting trained with Undersampled data",
    "Gradient boosting trained with Oversampled data",
    "AdaBoost trained with Undersampled data",
    "AdaBoost trained with Original data",
]
print("Performance comparison for validation set:")
models_val_comp_df
Performance comparison for validation set:
Out[155]:
Gradient boosting trained with Original data Gradient boosting trained with Undersampled data Gradient boosting trained with Oversampled data AdaBoost trained with Undersampled data AdaBoost trained with Original data
Accuracy 0.968 0.938 0.952 0.937 0.967
Recall 0.862 0.957 0.887 0.966 0.856
Precision 0.937 0.738 0.826 0.731 0.933
F1 0.898 0.833 0.855 0.832 0.893

Gradient Boosting trained with original data performed the best in general, therefore we will take it as our model for the test set final performance.

Test set final performance¶

In [156]:
# Let's check the performance on test set
gbm_test = model_performance_classification_sklearn(tuned_gbm, X_test, y_test)
gbm_test
Out[156]:
Accuracy Recall Precision F1
0 0.970 0.874 0.937 0.904
In [158]:
feature_names = X_train.columns
importances = tuned_gbm.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Total_Trans_Amt has the most importance, followed by Total_Trans_Ct and Total_Revolving_Bal

Business Insights and Conclusions¶

Observations¶

  • Customer age range is between 26 and 73,.
  • There is a decrease in transaction amount from Q1 to Q4 for most attrited customers.
  • Attrited Customers have a higher Total Transaction Amount distribution compared to existing customers. There is a negative correlation between Attrited Flag and Total_Trans_Amt.
  • There is a strong positive correlation between Total_Trans_Amt and Total_Trans_Ct.
  • Attrited customers tend to have a lower revolving balance.

Insights¶

  • Encourage transactions with cashbacks and other unique incentives.
  • Partner with brands to offer special discounts to increase revolving balance.
  • Help customers discover new products that could help their personalized financial goals as more than 50% of the attrited customers had 3 or less products.